Runbook: Node Not Ready
Alert
- Prometheus Alert:
KubeNodeNotReady/KubeNodeUnreachable - Grafana Dashboard: Cluster Health dashboard
- Firing condition: A Kubernetes node has been in NotReady state for more than 5 minutes
Severity
Critical -- A NotReady node means pods scheduled on that node may be unreachable or evicted. In a 3-node cluster (1 server + 2 agents), losing a node reduces capacity by 33-50% and may affect pod scheduling and HA guarantees.
Impact
- Pods on the affected node become unreachable
- DaemonSet pods (Alloy, NeuVector enforcer, node-exporter) stop reporting from that node
- Pod disruption budgets may prevent rescheduling if capacity is tight
- If the affected node is the RKE2 server (control plane), the Kubernetes API may become unavailable
- NeuVector enforcer loses runtime visibility on the affected node
Investigation Steps
- Check node status:
kubectl get nodes -o wide
- Describe the not-ready node for condition details:
kubectl describe node <node-name>
- Look at the conditions section for specific failures:
kubectl get node <node-name> -o jsonpath='{.status.conditions[*]}' | jq .
- Check if the node is reachable via SSH:
ssh sre-admin@<node-ip> "uptime && free -h && df -h"
- If SSH is available, check kubelet status:
ssh sre-admin@<node-ip> "sudo systemctl status rke2-agent"
# Or for server nodes:
ssh sre-admin@<node-ip> "sudo systemctl status rke2-server"
- Check kubelet logs on the node:
ssh sre-admin@<node-ip> "sudo journalctl -u rke2-agent --no-pager --since '30 minutes ago' | tail -100"
- Check for disk pressure:
ssh sre-admin@<node-ip> "df -h && df -i"
- Check for memory pressure:
ssh sre-admin@<node-ip> "free -h && cat /proc/meminfo | grep -E 'MemTotal|MemAvailable|SwapTotal'"
- Check for PID pressure:
ssh sre-admin@<node-ip> "ps aux | wc -l"
- Check containerd status:
ssh sre-admin@<node-ip> "sudo systemctl status containerd"
ssh sre-admin@<node-ip> "sudo crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock ps"
- Check system logs for hardware or kernel errors:
ssh sre-admin@<node-ip> "sudo dmesg | tail -50"
ssh sre-admin@<node-ip> "sudo journalctl -p err --since '1 hour ago' --no-pager"
- Check pods that were running on the not-ready node:
kubectl get pods -A --field-selector spec.nodeName=<node-name>
Resolution
kubelet/RKE2 service stopped
- Restart the RKE2 service:
# For agent nodes:
ssh sre-admin@<node-ip> "sudo systemctl restart rke2-agent"
# For server nodes:
ssh sre-admin@<node-ip> "sudo systemctl restart rke2-server"
- Wait 1-2 minutes and verify the node returns to Ready:
kubectl get node <node-name> -w
Disk pressure
- Identify large files or directories:
ssh sre-admin@<node-ip> "sudo du -sh /var/log/* | sort -rh | head -10"
ssh sre-admin@<node-ip> "sudo du -sh /var/lib/rancher/rke2/* | sort -rh | head -10"
- Clean up container images:
ssh sre-admin@<node-ip> "sudo crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock rmi --prune"
- Rotate and compress old logs:
ssh sre-admin@<node-ip> "sudo journalctl --vacuum-size=500M"
Memory pressure
- Check for pods consuming excessive memory:
kubectl top pods -A --sort-by=memory | head -20
-
If a specific pod is the cause, check its memory limits and consider adjusting the HelmRelease values
-
If system-level memory pressure, check for non-Kubernetes processes:
ssh sre-admin@<node-ip> "ps aux --sort=-%mem | head -20"
Network connectivity issues
- Check if the node can reach the API server:
ssh sre-admin@<node-ip> "curl -k https://127.0.0.1:6443/healthz"
- Check firewall rules:
ssh sre-admin@<node-ip> "sudo firewall-cmd --list-all"
- Verify required ports are open (RKE2 uses 6443, 9345, 10250, 2379-2380)
Node completely unresponsive
- If SSH is not available, attempt console access via Proxmox:
# From a machine with Proxmox access
ssh root@<proxmox-host> "qm status <vmid>"
- If the VM is stopped, start it:
ssh root@<proxmox-host> "qm start <vmid>"
- If the VM is running but unresponsive, force reset:
ssh root@<proxmox-host> "qm reset <vmid>"
- After the node comes back, verify it rejoins the cluster:
kubectl get nodes -w
Cordon and drain (if node needs maintenance)
- Cordon the node to prevent new pods:
kubectl cordon <node-name>
- Drain existing pods:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=300s
- Perform maintenance
- Uncordon when ready:
kubectl uncordon <node-name>
Prevention
- Monitor node conditions via the
kube_node_status_conditionmetric in Prometheus - Set disk usage alerts at 80% (warning) and 90% (critical)
- Set memory usage alerts at 85% (warning) and 95% (critical)
- Configure log rotation on all nodes via Ansible (
/etc/logrotate.d/) - Ensure RKE2 service is enabled on boot:
systemctl enable rke2-agent/systemctl enable rke2-server - Maintain at least 3 worker nodes for pod scheduling redundancy
- Run periodic CIS benchmark scans via NeuVector to catch drift
Escalation
- If the RKE2 server (control plane) node is not ready: this is a P1 incident -- the Kubernetes API may be unavailable
- If multiple nodes are not ready simultaneously: investigate shared infrastructure (network switch, storage, hypervisor)
- If the node cannot rejoin the cluster after restart: the node may need to be re-provisioned using Ansible